In [1]:
import codecs
import glob
import logging
import os
import re
import scipy
import spacy
import logging
import sys
import string
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import mode
from time import time
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from collections import Counter
from sklearn import ensemble
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from nltk.tokenize import sent_tokenize
from sklearn.model_selection import train_test_split,cross_val_score, KFold, cross_val_predict, GridSearchCV, StratifiedKFold
from sklearn.svm import LinearSVC
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import Normalizer, normalize
from sklearn.manifold import TSNE
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neural_network import MLPClassifier
%matplotlib inline
get_ipython().magic('pylab inline')
The purpose of this challenge is to classify authors using different novels that they have written. In this case supervised techniques have been used and compared to see which one is giving better results using tfidf and bag of words in all of them. Regarding the corpus, then authors have been chosen randomly from Gutenberg Project and 7 novels from those authors. Although initially ten novesl were picked, due to computing restrictions only seven have been left for the classification purposes. The authors that have been picked are:
In this notebook we will see the following steps:
To run the supervised parts of this challenge a new virtual machine has been set up to improve the computational performance. After initial trials on the machine with increased RAM 12GB, the conditions of the challenge were too resource intensive reasing why a virtual machine 8 vCPUs, 30 GB memory was set using Google Compute Engine.
Ten novels from four different authors have been retreived form Gutenberg project and a list of all the book files is created.
In [2]:
# Create a list of all of our book files.
book_filenames_austen = sorted(glob.glob("/home/borjaregueral/challengesuper2/austen/*.txt"))
book_filenames_chesterton = sorted(glob.glob("/home/borjaregueral/challengesuper2/chesterton/*.txt"))
book_filenames_conandoyle = sorted(glob.glob("/home/borjaregueral/challengesuper2/conandoyle/*.txt"))
book_filenames_elliot = sorted(glob.glob("/home/borjaregueral/challengesuper2/elliot/*.txt"))
The information is added to the copus and stored as raw books so that they can be cleansed
In [3]:
#Read and add the text of each book to corpus_raw.
corpus_raw_austen = u""
for book_filename in book_filenames_austen:
print("Reading '{0}'...".format(book_filename))
with codecs.open(book_filename, "r", "utf-8") as book_file:
corpus_raw_austen += book_file.read()
print("Corpus is now {0} characters long".format(len(corpus_raw_austen)))
print()
#Read and add the text of each book to corpus_raw.
corpus_raw_chesterton = u""
for book_filename in book_filenames_chesterton:
print("Reading '{0}'...".format(book_filename))
with codecs.open(book_filename, "r", "utf-8") as book_file:
corpus_raw_chesterton += book_file.read()
print("Corpus is now {0} characters long".format(len(corpus_raw_chesterton)))
print()
#Read and add the text of each book to corpus_raw.
corpus_raw_conandoyle = u""
for book_filename in book_filenames_conandoyle:
print("Reading '{0}'...".format(book_filename))
with codecs.open(book_filename, "r", "utf-8") as book_file:
corpus_raw_conandoyle += book_file.read()
print("Corpus is now {0} characters long".format(len(corpus_raw_conandoyle)))
print()
#Read and add the text of each book to corpus_raw.
corpus_raw_elliot = u""
for book_filename in book_filenames_elliot:
print("Reading '{0}'...".format(book_filename))
with codecs.open(book_filename, "r", "utf-8") as book_file:
corpus_raw_elliot += book_file.read()
print("Corpus is now {0} characters long".format(len(corpus_raw_elliot)))
print()
doc_complete = [corpus_raw_austen, corpus_raw_chesterton, corpus_raw_conandoyle,
corpus_raw_elliot]
In [4]:
book_file.close()
Before generating the features, and to increase the explanatory power of them, text has been cleaned and parsed accordingly. The books have gone through an initial set of cleansing actions before been parsed using Spacy, to reduce the computing effort required by the latter and then have been cleaned again before the feature generation.
The initial cleansing action has had three steps. The first step consisted on deleting all references to the Gutenberg Project from every book. This way, it has been avoided that words like “Gutenberg” and “Gutenberg Project” appear as features and distort the clustering of the authors.
As described below, cleaning actions have gone from removing all references to chapters, digits double whitespaces and references to numbers like dates and ordinal numbers. This has been followed by removing punctuation and common stop words that will only add noise to the features that are generated afterwards.
The remaining words, considered to have the most explanatory power regarding each of the titles from the authors, have been lemmatized and stemmed reducing up to 60% the computing resources needed. In the first case words from the same family are reduced to their lemmas and in the second case, additional prefixes and suffixes are removed. All cleaning operations have been carried out in a way that remaining sentences are stored in a list of lists.
In [5]:
#Create a set of stopwords in english from nltk
stop = set(stopwords.words('english'))
# Create a set of punctuation marks to exclude them from the text
exclude = set(string.punctuation)
# Call the lemmatizer
lemma = WordNetLemmatizer()
#Define a cleaning function that incorporates the different steps in the pipeline to clean the texts
def clean(doc):
doc = re.sub(r'--',' ',doc)
doc = re.sub("[\[].*?[\]]", "", doc)
doc = re.sub(r'Chapter \d+', '', doc)
doc = re.sub(r'CHAPTER .*', '', doc)
doc = re.sub('[0-9]+', '', doc)
doc = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", doc)
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
#Create a list of lists with all the documents
doc_clean = [clean(doc) for doc in doc_complete]
In [6]:
# Parse the cleaned novels
#load spacy for english language as all novels are in english
nlp = spacy.load('en')
#Parse novels one by one to maintain the author tagging
austen_doc = nlp(doc_clean[0])
chesterton_doc = nlp(doc_clean[1])
conandoyle_doc = nlp(doc_clean[2])
elliot_doc = nlp(doc_clean[3])
In [8]:
# Group into sentences.
austen_sents = [[str(sent), "Austen"] for sent in austen_doc.sents]
chesterton_sents = [[str(sent), "Chesterton"] for sent in chesterton_doc.sents]
conandoyle_sents = [[str(sent), "Conandoyle"] for sent in conandoyle_doc.sents]
elliot_sents = [[str(sent), "elliot"] for sent in elliot_doc.sents]
# Combine the sentences from the two novels into one data frame.
names = ['Sentences','Author']
sent = pd.DataFrame(austen_sents + chesterton_sents +
conandoyle_sents +
elliot_sents, columns = names)
#Plot the contribution of each author to the corpus (sentences)
sent.Author.value_counts().plot(kind='bar', grid=False, figsize=(16, 9))
Out[8]:
In [10]:
#Aadd numerical column to tag the authors for supervised classification
sent.loc[sent['Author'] == 'Austen', 'Target'] = 0
sent.loc[sent['Author'] == 'Chesterton', 'Target'] = 1
sent.loc[sent['Author'] == 'Conandoyle', 'Target'] = 2
sent.loc[sent['Author'] == 'elliot', 'Target'] = 3
Features using BoW
Texts have been vectorized using bag of words. In this case the algorithm counts the numnber of times a word appears in a certain text. During the creation of the bag of words space, ngrams up to 4 components have been considered and stop words in english to remove noise from the dataset. Due to the authors that have been chosen, this method will bias the models towards the authors that have longer texts being Elliot and Austen compared to Conan Doyle and Chesterton. The total number of features is 52k.
In [11]:
#Transform into Bag of Words
vec = CountVectorizer(max_df = 0.75 , min_df = 2 , ngram_range = (1,4), stop_words = 'english')
#Build the predictors and the predicted variable applying BoW.
X = vec.fit_transform(sent['Sentences'])
y = sent['Target']
#Split the data set into train and test 70/30
X_train_bow, X_test_bow, y_train_bow, y_test_bow = train_test_split(X,y, test_size=0.30, random_state=1234)
X_train_bow.shape
Out[11]:
Features using Tf-idf
When using tfidf, the frequency of appearance is normalized and also considered the ones that appear in less than 75% of the documents. With this method, the value counts are smoothen considering additional features of the word such as the amount of information it adds to describe the novel. As in the case of the ba og words, ngamrs up to four have been considered, stop words removed and thesublinear_tf used. It Apply scales the word count obtained and smoothened by the frequency of appearence in the document and whithin a document.
In [12]:
#Transform into Tf-idf considering the relative frequency
vect = TfidfVectorizer(norm = 'l2', max_df = 0.75 , min_df = 2 , ngram_range = (1,4), stop_words = 'english',
use_idf = True, sublinear_tf = True)
#Build the predictors and the predicted variable applying BoW.
X_tfidf = vect.fit_transform(sent['Sentences'])
y_tfidf = sent['Target']
#Split the data set into train and test 70/30
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf,y_tfidf, test_size=0.30, random_state=1234)
Five folds have been defined and will be used to tune and evaluate the models
In [13]:
#KFold for cross validation analysis
kf = KFold(n_splits=5, shuffle=True, random_state=123)
All models have been run using the features obtained through bag of words and tfidf. In this case results are compared to see which one gives a better overall accuracy as it has been used as the score function. In all cases cross validation over five folds is applied.
Bag of Words
A Logistic Regression Classifier is trained using the features obtained through tfidf. Additionally, using fridsearch the parameters are tunned. As length of texts and therefore the features per author are not balanced, the class weight is set up so that is consideres unbalanced classes.
In [12]:
# Initialize and fit the model.
log_reg_bow = LogisticRegression(class_weight='balanced', penalty = 'l2', multi_class= 'multinomial', max_iter = 1000)
#Tune parameters: C parameter
c_param = [ 0.1, 0.5, 1 ]
#Tune the type of penalty used between l1 and l2
solver_param = ['newton-cg', 'lbfgs']
parameters = {'C': c_param, 'solver': solver_param}
#Fit parameters
log_reg_tuned_bow = GridSearchCV(log_reg_bow, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
log_reg_tuned_bow.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters logistic regression BoW:\n {}\n').format(log_reg_tuned_bow.best_params_))
After the parameters are tunned, the model is fit in the test dataset. As a measurement of the computing effort it requires 3.6 min to fit the test set.
In [13]:
#Once the model has been trained test it on the test dataset
log_reg_tuned_bow.fit(X_test_bow, y_test_bow)
# Predict on test set
predtest_y_bow = log_reg_tuned_bow.predict(X_test_bow)
The model is evaluated on the test set. In this case the solver has been chosen between the different options that support multiclass classification. As it can be seen in the classification report the model presents overfitting being the precision and recall close to one in all classes expect for class five (Huxley) which is the one that reduces the overall accuracy of the model.
In [14]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report BoW: \n {}')
.format(classification_report(y_test_bow, predtest_y_bow,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_bow, predtest_y_bow)
print(('Confusion Matrix BoW: \n\n {}\n'
).format(confusion_bow))
print(('Logistic Regression set accuracy BoW: {0:.2f} % \n'
).format(cross_val_score(log_reg_tuned_bow, X_test_bow, y_test_bow,cv=kf).mean()*100
))
The logistic regression model is computationally efficient as it fits the dataset with over 50k in less than two minutes making it a string candidate to move intro production. The overall accuracy is nearly 77% which is roughly five percentage points more than in the challenge for this unit. The accuracy is higher than the one obainted by undsupervised methdos using clustering as is much more stable. In this case, the introduction of the test set, unseen by the model is not provoking unstable classifications.
TF-idf
A Logistic Regression Classifier is trained using the features obtained through tfidf. Additionally, using fridsearch the parameters are tunned. As length of texts and therefore the features per author are not balanced, the class weight is set up so that is consideres unbalanced classes. In this case the parameter of the model C is higher than the one used with the bag of words.
In [15]:
# Initialize and fit the model.
log_reg_tfidf = LogisticRegression(class_weight='balanced', penalty = 'l2', multi_class= 'multinomial', max_iter = 600)
#Tune parameters
#C parameter
c_param = [ 0.1, 0.5, 1 ]
#Tune the type of penalty used between l1 and l2
solver_param = ['newton-cg','lbfgs']
parameters = {'C': c_param, 'solver': solver_param}
#Fit parameters
log_reg_tuned_tfidf = GridSearchCV(log_reg_tfidf, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
log_reg_tuned_tfidf.fit(X_train_tfidf, y_train_tfidf)
#Print the best parameters
print(('Best paramenters logistic regression Tfidf: \n{}\n'
).format(log_reg_tuned_tfidf.best_params_))
After the parameters are tunned, the model is fit in the test dataset. As a measurement of the computing effort it requires less than one min to fit the test set.
In [16]:
#Once the model has been trained test it on the test dataset
log_reg_tuned_tfidf.fit(X_test_tfidf, y_test_tfidf)
# Predict on test set
predtest_y_tfidf = log_reg_tuned_tfidf.predict(X_test_tfidf)
The model is evaluated on the test set. In this case the solver has been chosen between the different options that support multiclass classification. As it can be seen in the classification report the model presents overfitting being the precision and recall close to one in all classes expect for class five (Huxley) which is the one that reduces the overall accuracy of the model.
In [17]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report Tf-idf: \n {}')
.format(classification_report(y_test_tfidf, predtest_y_tfidf,
target_names=target_names)))
confusion_tfidf = confusion_matrix(y_test_tfidf, predtest_y_tfidf)
print(('Confusion Matrix Tf-idf: \n\n {}\n'
).format(confusion_tfidf))
print(('Logistic Regression set accuracy Tf-idf: {0:.2f} % \n'
).format(cross_val_score(log_reg_tuned_tfidf, X_test_tfidf, y_test_tfidf,cv=kf).mean()*100
))
The logistic regression model is computationally efficient as it fits the dataset with over 80k in less than two minutes making it a string candidate to move intro production. The overall accuracy is nearly 80% which is roughly five percentage points more than in the challenge for this unit. The accuracy is higher than the one obainted by undsupervised methdos using clustering as is much more stable. In this case, the introduction of the test set, unseen by the model is not provoking unstable classifications.
Bernoulli Classifier
Bag of Words
A Bernoulli classifier has been tunned and trained in the feautures obtained through Tf-idf. In this case the simplicity of the model added to the good classification results make of this model a good candidate to move into production. The time required to train it is lower than the time required to train the logistic regression one.
In [19]:
# Initialize and fit the model.
naive_bayes_bernoulli_bow = BernoulliNB()
#Tune hyperparameters
#Create range of values to fit parameters
alpha = [0.0001, 0.001, 0.01]
parameters = {'alpha': alpha}
#Fit parameters using gridsearch
naive_bayes_bernoulli_tuned_bow = GridSearchCV(naive_bayes_bernoulli_bow, n_jobs = -1, param_grid=parameters, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
naive_bayes_bernoulli_tuned_bow.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters logistic Naive-Bayes Bernoulli BoW: \n{}\n').format(naive_bayes_bernoulli_tuned_bow.best_params_))
After several runs, with different extremes in the values of the alpha parameter, the parameter chosen is always the one closer to zero. This means that the smoothing parameter is very low so the additive smoothing required is low. The model is fit within seconds which makes it a strong candidate (the best one from a computational and speed standpoint) to move intro production.
In [20]:
#Once the model has been trained test it on the test dataset
naive_bayes_bernoulli_tuned_bow.fit(X_test_bow, y_test_bow)
# Predict on test set
predtest_y_bow = naive_bayes_bernoulli_tuned_bow.predict(X_test_bow)
The model is evaluated using cross validation and five folds. In this case as in the case of logistic regression the model presents overfitting as it can be seen from the classification report. Both precision and recall is one for this reason.
In [21]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report BoW: \n {}\n').format(
classification_report(y_test_bow, predtest_y_bow,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_bow, predtest_y_bow)
print(('Confusion Matrix BoW: \n\n {}\n\n').format(confusion_bow))
print(('Bernoulli Classifier set accuracy BoW: {0:.2f} %\n').format(cross_val_score(naive_bayes_bernoulli_tuned_bow,
X_test_bow,
y_test_bow,cv=kf).mean()*100))
The overall accuracy of the model is slightly lower than the accuracy obtained with the logistic regression classifier. However, the time required to fit the model is at least one tenth of the time required for the logistic regression presenting both overfitting. Hence, if overall accuracy is what is tried to be improved, this is the best model with a very small loss of accuracy scoring 81.75%.
Tf-idf
A Bernoulli classifier has been tunned and trained in the feautures obtained through Tf-idf. In this case the simplicity of the model added to the good classification results make of this model a good candidate to move into production. The time required to train it is lower than the time required to train the logistic regression one.
In [22]:
# Initialize and fit the model.
naive_bayes_bernoulli_tfidf = BernoulliNB()
#Tune hyperparameters
#Create range of values to fit parameters
alpha = [0.001, 0.01,0.1]
parameters = {'alpha': alpha}
#Fit parameters using gridsearch
naive_bayes_bernoulli_tuned_tfidf = GridSearchCV(naive_bayes_bernoulli_tfidf,
n_jobs = -1,
param_grid=parameters,
cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
naive_bayes_bernoulli_tuned_tfidf.fit(X_train_tfidf, y_train_tfidf)
#Print the best parameters
print(('Best paramenters logistic Naive-Bayes Bernoulli Tfidf: \n{}\n').format(naive_bayes_bernoulli_tuned_tfidf.best_params_))
After several runs, with different extremes in the values of the alpha parameter, the parameter chosen is always the one closer to zero. This means that the smoothing parameter is very low so the additive smoothing required is low. The model is fit within seconds which makes it a strong candidate (the best one from a computational and speed standpoint) to move intro production.
In [23]:
#Once the model has been trained test it on the test dataset
naive_bayes_bernoulli_tuned_tfidf.fit(X_test_tfidf, y_test_tfidf)
# Predict on test set
predtest_y_tfidf = naive_bayes_bernoulli_tuned_tfidf.predict(X_test_tfidf)
he model is evaluated using cross validation and five folds. In this case as in the case of logistic regression the model presents overfitting as it can be seen from the classification report. Both precision and recall is one for this reason.
In [24]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report Tfidf: \n {}').format(classification_report(y_test_tfidf, predtest_y_tfidf,
target_names=target_names)))
confusion_tfidf = confusion_matrix(y_test_tfidf, predtest_y_tfidf)
print(('Confusion Matrix Tf-idf: \n\n {}\n').format(confusion_tfidf))
print(('Bernoulli Classifier Tf-Idf set accuracy Tf-idf: {0:.2f} % \n').format(cross_val_score(naive_bayes_bernoulli_tuned_tfidf,
X_test_tfidf,
y_test_tfidf,
cv=kf).mean()*100))
The overall accuracy of the model is slightly higher than the accuracy obtained with the logistic regression classifier (81.58%). However, the time required to fit the model is at least one tenth of the time required for the logistic regression presenting both overfitting. In this case is class seven (Shaw) the one that shows the lowest precision being the one that determines the lower value of the overall accuracy when compared to the Bernoulli model. Hence, if overall accuracy is what is tried to be improved, this is the best model with a very small loss of accuracy
Multinomial Classifier
BoW
A multinomial classifier is trained on the features obtained using tfidf and evaluated on the holdout. In this case, as in the previous Navy Bayes classification used, alpha always gets the value cloaer to zero, therefore there is no additive smoothing used in this classifier. From a compuational effort standpoint, as in the previous case, this is the one that requires less time to fit making it a strong candidate to move into production.
In [25]:
# Initialize and fit the model.
naive_bayes_multinomial_bow = MultinomialNB()
#Tune hyperparameters
#Create range of values to fit parameters
alpha = [0.01,0.1,0.5]
parameters = {'alpha': alpha}
#Fit parameters using gridsearch
naive_bayes_multinomial_tuned_bow = GridSearchCV(naive_bayes_multinomial_bow,
n_jobs = -1,
param_grid=parameters,
cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
naive_bayes_multinomial_tuned_bow.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters Naive-Bayes Multinomial BoW:\n {}\n').format(
naive_bayes_multinomial_tuned_bow.best_params_))
The value of alpha is in all trials the closest one to zero being the additive smoothing lose. In this case the time required for fitting is less than one minute. The model is then evaluated on the test set. For that, the first step is to fit the test hodout of the dataset.
In [26]:
#Once the model has been trained test it on the test dataset
naive_bayes_multinomial_tuned_bow.fit(X_test_bow, y_test_bow)
# Predict on test set
predtest_y_bow = naive_bayes_multinomial_tuned_bow.predict(X_test_bow)
The model presents overfitting and the accuracy is slightly higher than in the previous case 3% more. The confusion matrix presents a lower number of false positives and negatives for all categories, taking into account that the size of each of them is different results are consistent across all of them.
In [27]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report BoW: \n {}\n').format(
classification_report(y_test_bow, predtest_y_bow,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_bow, predtest_y_bow)
print((
'Confusion Matrix BoW: \n\n {}\n\n').format(confusion_bow))
print((
'Multinomial Classifier set accuracy BoW: {0:.2f} %\n'
).format(cross_val_score(naive_bayes_multinomial_tuned_bow, X_test_bow, y_test_bow,cv=kf).mean()*100))
The time required to fit the model is lower than in any other case presenting a higher accuracy. In this case, the accuracy is close to 84.12% while the classification report shows values close to one, showing that there is overfitting. Hence, from the classifiers evaluated until now this is the one that presents better results, from an accuracy and a computational effort perspective. This is the best candidate to move into production for the moment.
Tf-idf
A multinomial classifier is trained on the features obtained using tfidf and evaluated on the holdout. In this case, as in the previous Navy Bayes classification used, alpha always gets the value cloaer to zero, therefore there is no additive smoothing used in this classifier. From a compuational effort standpoint, as in the previous case, this is the one that requires less time to fit making it a strong candidate to move into production.
In [28]:
# Initialize and fit the model.
naive_bayes_multinomial_tfidf = MultinomialNB()
#Tune hyperparameters
#Create range of values to fit parameters
alpha = [0.01,0.1,0.5,1]
parameters = {'alpha': alpha}
#Fit parameters using gridsearch
naive_bayes_multinomial_tuned_tfidf = GridSearchCV(naive_bayes_multinomial_tfidf,
n_jobs = -1,
param_grid=parameters,
cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
naive_bayes_multinomial_tuned_tfidf.fit(X_train_tfidf, y_train_tfidf)
#Print the best parameters
print(('Best paramenters Naive-Bayes Multinomial BoW:\n {}\n').format(
naive_bayes_multinomial_tuned_tfidf.best_params_))
he value of alpha is in all trials the closest one to zero being the additive smoothing lose. In this case the time required for fitting is less than one minute. The model is then evaluated on the test set. For that, the first step is to fit the test hodout of the dataset.
In [29]:
#Once the model has been trained test it on the test dataset
naive_bayes_multinomial_tuned_tfidf.fit(X_test_tfidf, y_test_tfidf)
# Predict on test set
predtest_y_tfidf = naive_bayes_multinomial_tuned_tfidf.predict(X_test_tfidf)
The model presents overfitting and the accuracy is slightly higher than in the previous case 3% more. The confusion matrix presents a lower number of false positives and negatives for all categories, taking into account that the size of each of them is different results are consistent across all of them.
In [30]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report tfidf: \n {}').format(classification_report(y_test_tfidf,
predtest_y_tfidf,
target_names=target_names)))
confusion_tfidf = confusion_matrix(y_test_tfidf, predtest_y_tfidf)
print(('Confusion Matrix Tf-idf: \n\n {}\n').format(confusion_tfidf))
print(('Multinomial Classifier set accuracy Tf-idf: {0:.2f} % \n').format(cross_val_score(naive_bayes_multinomial_tuned_tfidf,
X_test_tfidf,
y_test_tfidf,
cv=kf).mean()*100))
The time required to fit the model is lower than in any other case presenting a higher accuracy. In this case, the accuracy is close to 83.67% while the classification report shows values close to one, showing that there is overfitting. Hence, from the classifiers evaluated until now this is the one that presents better results, from an accuracy and a computational effort perspective. This is the best candidate to move into production for the moment.
Bag of Words
The KNN classifier has been fit using bag of words. In this case during the gridsearch, five neighbors have been selected as the optimumm number of neighbors when using bag of words
In [31]:
# Initialize and fit the model.
KNN_bow = KNeighborsClassifier(weights = 'distance')
#Tune hyperparameters
#Create range of values to fit parameters
neighbors = [3, 5, 7,9]
#Fit parameters
parameters = {'n_neighbors': neighbors}
#Fit parameters using gridsearch
KNN_tuned_bow = GridSearchCV(KNN_bow, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
KNN_tuned_bow.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters KNN BoW:\n {}\n').format(
KNN_tuned_bow.best_params_))
Once the model has been tuned, it is fit in the test holdout
In [32]:
#Once the model has been trained test it on the test dataset
KNN_tuned_bow.fit(X_test_bow, y_test_bow)
# Predict on test set
predtest_y_bow = KNN_tuned_bow.predict(X_test_bow)
The evaluation of the model is done using the classification report, confusion matrix and overall accuracy. In this case KNN works worse than other models as it does not have enough data. From the classification report it can be seen that the model is not overfitting having a high but not equal to one precision and recall. Author two is the one that is scoring the worst results.
In [33]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report BoW: \n {}\n').format(
classification_report(y_test_bow, predtest_y_bow,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_bow, predtest_y_bow)
print((
'Confusion Matrix BoW: \n\n {}\n\n').format(confusion_bow))
print((
'KNN accuracy BoW: {0:.2f} %\n'
).format(cross_val_score(KNN_tuned_bow, X_test_bow, y_test_bow,cv=kf).mean()*100))
The model is scoring really low from the accuracy that is normally achieved when using KNN. One of the reaons is the amount of data used to fit the model.
Tf- idf
The model is fit on the training set using the features obtained using tfidf. In this case the tuning of the model give lower parameters as the features have been already smoothened being the number of neighbors equal to three.
In [14]:
# Initialize and fit the model.
KNN_tfidf = KNeighborsClassifier(weights = 'distance')
#Tune hyperparameters
#Create range of values to fit parameters
neighbors = [3, 5, 7,9]
#Fit parameters
parameters = {'n_neighbors': neighbors}
#Fit parameters using gridsearch
KNN_tuned_tfidf = GridSearchCV(KNN_tfidf,
param_grid=parameters,
n_jobs = -1,
cv=kf,
verbose = 1)
#Fit the tunned classifier in the training space
KNN_tuned_tfidf.fit(X_train_tfidf, y_train_tfidf)
#Print the best parameters
print(('Best paramenters KNN Tfidf:\n {}\n').format(KNN_tuned_tfidf.best_params_))
Once the parameters are tuned the model is fit on the test set.
In [15]:
#Once the model has been trained test it on the test dataset
KNN_tuned_tfidf.fit(X_test_tfidf, y_test_tfidf)
# Predict on test set
predtest_y_tfidf = KNN_tuned_tfidf.predict(X_test_tfidf)
In this case, the accuracy obtained with tfidf is not very different from the accuracy obtained with the bag of words. Better results would be obtained if more data is used to run the model
In [16]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report Tfidf: \n {}\n').format(
classification_report(y_test_tfidf, predtest_y_tfidf,
target_names=target_names)))
confusion_tfidf = confusion_matrix(y_test_tfidf, predtest_y_tfidf)
print((
'Confusion Matrix Tfidf: \n\n {}\n\n').format(confusion_tfidf))
print((
'KNN accuracy Tfidf: {0:.2f} %\n'
).format(cross_val_score(KNN_tuned_tfidf, X_test_tfidf, y_test_tfidf,cv=kf).mean()*100))
Regarding the time used by this model, it is unexpectedly low as it runs over a small dataset. This is the reason why the values obtained are so low when compared to the results obtained through the bag of words.
Bag of Words
The SDG classifier is fit on the training set. The SGD Classifier uses regularized linear models with stochastic gradient descendent learning. The model is updated in its learning rate after the gradient of the loss is estaimated for each sample. This classifier can work with sparse data se the one obtained from bag of words. In this case from the types of penalties the algorithm accepts, it uses L2 instead of a combination of L! and L2 implemented through Elastic Net.
In [37]:
# Initialize and fit the model.
SGD_bow = SGDClassifier(class_weight = 'balanced', max_iter=1000)
#Tune hyperparameters
#Create range of values to fit parameters
loss_param = ['hinge', 'squared_hinge']
penalty_param = ['l2', 'elasticnet']
alpha_param = [0.1, 1, 10, 100]
#Fit parameters
parameters = {'loss': loss_param,
'penalty': penalty_param,
'alpha': alpha_param}
#Fit parameters using gridsearch
SGD_tuned_bow = GridSearchCV(SGD_bow, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
SGD_tuned_bow.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters SGD BoW:\n {}\n').format(
SGD_tuned_bow.best_params_))
The parameters show that the smooting continues to be loose as a first option as it is a regression with a gradient descendent algorithm. Regarding the loss, the hinge loss is used which means that the real loss, in case it is not convergent due to the sparse data used is replaced by the upper bond forcing its convergence. Time required is significanlty higher than in the case of the Naive Bayes classifiers
In [38]:
#Once the model has been trained test it on the test dataset
SGD_tuned_bow.fit(X_test_bow, y_test_bow)
# Predict on test set
predtest_y_bow = SGD_tuned_bow.predict(X_test_bow)
This model presents overfitting as all precision and recall are equal to one for every class. The confusion matrix shows a lower number of false negatives and positives per class being more or less evenly represented except for class three.
In [39]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report BoW: \n {}\n').format(
classification_report(y_test_bow, predtest_y_bow,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_bow, predtest_y_bow)
print((
'Confusion Matrix BoW: \n\n {}\n\n').format(confusion_bow))
print((
'SGD accuracy BoW: {0:.2f} %\n'
).format(cross_val_score(SGD_tuned_bow, X_test_bow, y_test_bow,cv=kf).mean()*100))
In this case, the overall accuracy is 72.57%, very similar to the overall accuracy obtained using the multinomial classifier. The computational effort required by this model to achieve this accuracy is much higher than in the case of the multinomial classifier. Hence, from a production perspective, this model would not be recommended to move into production despite of its high accuracy.
Tf- idf
The SGD Classifier uses regularized linear models with stochastic gradient descendent learning. The model is updated in its learning rate after the gradient of the loss is estaimated for each sample. This classifier can work with sparse data se the one obtained from tfidf. In this case from the types of penalties the algorithm accepts, it uses L2 instead of a combination of L! and L2 implemented through Elastic Net.
In [40]:
# Initialize and fit the model.
SGD_tfidf = SGDClassifier(class_weight = 'balanced', max_iter=1000)
#Tune hyperparameters
#Create range of values to fit parameters
loss_param = ['hinge', 'squared_hinge']
penalty_param = ['elasticnet', 'l2' ]
alpha_param = [1, 0.0001, 0.001, 0.01, 0.1]
#Fit parameters
parameters = {'loss': loss_param,
'penalty': penalty_param,
'alpha': alpha_param}
#Fit parameters using gridsearch
SGD_tuned_tfidf = GridSearchCV(SGD_tfidf, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
SGD_tuned_tfidf.fit(X_train_tfidf, y_train_tfidf)
#Print the best parameters
print(('Best paramenters SDG Tfidf:\n {}\n').format(
SGD_tuned_tfidf.best_params_))
The parameters show that the smooting continues to be loose as a first option as it is a regression with a gradient descendent algorithm. Regarding the loss, the hinge loss is used which means that the real loss, in case it is not convergent due to the sparse data used is replaced by the upper bond forcing its convergence. Time required is significanlty higher than in the case of the Naive Bayes classifiers
In [41]:
#Once the model has been trained test it on the test dataset
SGD_tuned_tfidf.fit(X_test_tfidf, y_test_tfidf)
# Predict on test set
predtest_y_tfidf = SGD_tuned_tfidf.predict(X_test_tfidf)
This model presents overfitting as all precision and recall are equal to one for every class. The confusion matrix shows a lower number of false negatives and positives per class being more or less evenly represented except for class one.
In [42]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report Tfidf: \n {}\n').format(
classification_report(y_test_tfidf, predtest_y_tfidf,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_tfidf, predtest_y_tfidf)
print((
'Confusion Matrix Tfidf: \n\n {}\n\n').format(confusion_tfidf))
print((
'SGD accuracy Tfidf: {0:.2f} %\n'
).format(cross_val_score(SGD_tuned_tfidf, X_test_tfidf, y_test_tfidf,cv=kf).mean()*100))
In this case, the overall accuracy is 80.78%, very similar to the overall accuracy obtained using the multinomial classifier. The computational effort required by this model to achieve this accuracy is much higher than in the case of the multinomial classifier . Hence, from a production perspective, this model would not be recommended to move into production despite of its high accuracy.
Bag of Words
The hyperparamters of the random forest model have been tuned one by one. After trying to tune them all at once, a significant increase of the overall performance of the classifier was obtained with the proposed method (one by one). The parameters to be tuned are (in the same order as the hyperparameter tuning has been performed):
N_estimators determining the number of trees that will be part of the algorithm. Max depth determining the size of the tree.
In [49]:
# Initialize and fit the model.
rf_bow = RandomForestClassifier(class_weight = 'balanced')
#Tune hyperparameters
#Create range of values to fit parameters
n_estimators_param = np.arange(250,401,20)
max_depth_param = np.arange(46,63,2)
#Fit parameters
parameters = {'n_estimators': n_estimators_param,
'max_depth': max_depth_param}
#Fit parameters using gridsearch
rf_tuned_bow = GridSearchCV(rf_bow, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
rf_tuned_bow.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters Random Forest BoW:\n {}\n').format(rf_tuned_bow.best_params_))
The tuned model is fit and run on the test set
In [50]:
#Once the model has been trained test it on the test dataset
rf_tuned_bow.fit(X_test_bow, y_test_bow)
# Predict on test set
predtest_y_bow = rf_tuned_bow.predict(X_test_bow)
The overall accuracy of the model has significantly increase compared to the previous classifiers achieving 73%. This result is low for the type of classifier used. Additionally it is lower than the results obtained with other classifiers. In this case, author seven is the one that is decreasig the overall accuracy.
In [51]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report BoW: \n {}\n').format(
classification_report(y_test_bow, predtest_y_bow,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_bow, predtest_y_bow)
print((
'Confusion Matrix BoW: \n\n {}\n\n').format(confusion_bow))
print((
'Random Forest accuracy BoW: {0:.2f} %\n'
).format(cross_val_score(rf_tuned_bow, X_test_bow, y_test_bow,cv=kf).mean()*100))
This classifier requires more time to run than the Naive Bayes ones and throws poorer results than them. Author three is the one that is reducing the overall accuracy.
Tf-idf
The hyperparamters of the random forest model have been tuned one by one. After trying to tune them all at once, a significant increase of the overall performance of the classifier was obtained with the proposed method (one by one). The parameters to be tuned are (in the same order as the hyperparameter tuning has been performed):
N_estimators determining the number of trees that will be part of the algorithm. Max depth determining the size of the tree.
In [52]:
# Initialize and fit the model.
rf_tfidf = RandomForestClassifier(class_weight = 'balanced')
#Tune hyperparameters
#Create range of values to fit parameters
n_estimators_param = np.arange(100,201,10)
max_depth_param = np.arange(50,71,5)
#Fit parameters
parameters = {'n_estimators': n_estimators_param,
'max_depth': max_depth_param}
#Fit parameters using gridsearch
rf_tuned_tfidf = GridSearchCV(rf_tfidf, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
rf_tuned_tfidf.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters Random Forest Tfidf:\n {}\n').format(
rf_tuned_tfidf.best_params_))
The tuned model is fit and run on the test set
In [53]:
#Once the model has been trained test it on the test dataset
rf_tuned_tfidf.fit(X_test_tfidf, y_test_tfidf)
# Predict on test set
predtest_y_tfidf = rf_tuned_tfidf.predict(X_test_tfidf)
The overall accuracy of the model has significantly increase compared to the previous classifiers achieving 73%. This result is low for the type of classifier used. Additionally it is lower than the results obtained with other classifiers. In this case, author seven is the one that is decreasig the overall accuracy.
In [54]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report Tfidf: \n {}\n').format(
classification_report(y_test_tfidf, predtest_y_tfidf,
target_names=target_names)))
confusion_tfidf = confusion_matrix(y_test_tfidf, predtest_y_tfidf)
print((
'Confusion Matrix Tfidf: \n\n {}\n\n').format(confusion_tfidf))
print((
'Random Forest accuracy Tfidf: {0:.2f} %\n'
).format(cross_val_score(rf_tuned_tfidf, X_test_tfidf, y_test_tfidf,cv=kf).mean()*100))
This classifier requires more time to run than the Naive Bayes ones and throws poorer results than them. Author three is the one that is reducing the overall accuracy.
Bag of Words
A linear support vector classifier has been set up and tuned on the training data and run on the test set. The hyperparameters that have been tuned are:
C parameter, acting on the margin hyperplane having a bigger margin when C is smaller. (The value of C will tell the SVM how much misclassification is to be avoided). The loss parameter. In this case the crammer singer algorithm is used to solve the multiclass classification problem. This algorithm optimizes the joint objective over all classes but it is not interesting from a production standpoint as it rarely leads to better accuracy and is more expensive to compute. Due to the size of the feature´s space the linear SVC has been used instead of the SVC due to computational restrictions.
In [60]:
# Initialize and fit the model.
LSVC_bow = LinearSVC(class_weight='balanced', multi_class = 'crammer_singer')
#Tune hyperparameters
#Create range of values to fit parameters
loss_param = ['hinge','squared_hinge']
C_param = [1, 10, 100, 100000]
#Fit parameters
parameters = { 'loss': loss_param,
'C': C_param}
#Fit parameters using gridsearch
LSVC_tuned_bow = GridSearchCV(LSVC_bow, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
LSVC_tuned_bow.fit(X_train_bow, y_train_bow)
#Print the best parameters
print(('Best paramenters LinearSVC BoW:\n {}\n').format(
LSVC_tuned_bow.best_params_))
Once the parameters have been tunned the model is fit in the testing dataset
In [61]:
#Once the model has been trained test it on the test dataset
LSVC_tuned_bow.fit(X_test_bow, y_test_bow)
# Predict on test set
predtest_y_bow = LSVC_tuned_bow.predict(X_test_bow)
Although from a computational perspective it requires more effort, it presents better results than the previous algorithms. In this case, nearly 73% has been achieved competing agasint the multiclass algorithm in terms of accuracy but not in terms of computational effort.
In [62]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report BoW: \n {}\n').format(
classification_report(y_test_bow, predtest_y_bow,
target_names=target_names)))
confusion_bow = confusion_matrix(y_test_bow, predtest_y_bow)
print((
'Confusion Matrix BoW: \n\n {}\n\n').format(confusion_bow))
print((
'Linear SVC accuracy BoW: {0:.2f} %\n'
).format(cross_val_score(LSVC_tuned_bow, X_test_bow, y_test_bow,cv=kf).mean()*100))
The algorithm presents overfitting as it can be seen from the classification report. Although recall and precision are one, in reality they are lower than one having an overall accuracy of 79.37%. Furthermore, the time required to fit the dataset is higher than the one required wuth the Naive Bayes algorithms.
Tf-idf
A linear support vector classifier has been set up and tuned on the training data and run on the test set. The hyperparameters that have been tuned are:
C parameter, acting on the margin hyperplane having a bigger margin when C is smaller. (The value of C will tell the SVM how much misclassification is to be avoided). The loss parameter. In this case the crammer singer algorithm is used to solve the multiclass classification problem. This algorithm optimizes the joint objective over all classes but it is not interesting from a production standpoint as it rarely leads to better accuracy and is more expensive to compute. Due to the size of the feature´s space the linear SVC has been used instead of the SVC due to computational restrictions.
In [65]:
# Initialize and fit the model.
LSVC_tfidf = LinearSVC(class_weight='balanced', multi_class = 'crammer_singer')
#Tune hyperparameters
#Create range of values to fit parameters
loss_param = ['hinge','squared_hinge']
C_param = [0.1, 1, 10, 100]
#Fit parameters
parameters = {
'loss': loss_param,
'C': C_param}
#Fit parameters using gridsearch
LSVC_tuned_tfidf = GridSearchCV(LSVC_tfidf, param_grid=parameters, n_jobs = -1, cv=kf, verbose = 1)
#Fit the tunned classifier in the training space
LSVC_tuned_tfidf.fit(X_train_tfidf, y_train_tfidf)
#Print the best parameters
print(('Best paramenters Linear SVC Tfidf:\n {}\n').format(LSVC_tuned_tfidf.best_params_))
Once the parameters have been tunned the model is fit in the testing dataset
In [66]:
#Once the model has been trained test it on the test dataset
LSVC_tuned_tfidf.fit(X_test_tfidf, y_test_tfidf)
# Predict on test set
predtest_y_tfidf = LSVC_tuned_tfidf.predict(X_test_tfidf)
Although from a computational perspective it requires more effort, it presents better results than the previous algorithms. In this case, nearly 79% has been achieved competing agasint the multiclass algorithm in terms of accuracy but not in terms of computational effort.
In [67]:
#Evaluation of the model (testing)
target_names = ['0.0', '1.0', '2.0', '3.0']
print(('Classification Report Tfidf: \n {}\n').format(
classification_report(y_test_tfidf, predtest_y_tfidf,
target_names=target_names)))
confusion_tfidf = confusion_matrix(y_test_tfidf, predtest_y_tfidf)
print((
'Confusion Matrix Tfidf: \n\n {}\n\n').format(confusion_tfidf))
print((
'Linear SVC accuracy Tfidf: {0:.2f} %\n'
).format(cross_val_score(LSVC_tuned_tfidf, X_test_tfidf, y_test_tfidf,cv=kf).mean()*100))
The algorithm presents overfitting as it can be seen from the classification report. Although recall and precision are one, in reality they are lower than one having an overall accuracy of 79.37%. Furthermore, the time required to fit the dataset is higher than the one required wuth the Naive Bayes algorithms.
The accuracy improvement of all of the models has been done in the capstone project. To achieve this improvement the steps that have been taken have been:
The results obtained once all the stepst have been taken are:
From the initial set of results obtained in this challenge:
From all the improvements made, I pick up the one made in the SGD classifier that goes from 80.78% to 87.12%. The changes made in the model can be seen in the capstone project.